Introduction to Linear Regression

Linear Regression

linear regression is a useful tool for predicting a quantitative response
Serves as a good jumping-off point for newer approaches because many fancy statistical learning approaches can be seen as generalizations or extensions of linear regression

Linear Regression and Advertising Data

Consider the advertising data shown above
- Questions that we might ask:
  - Is there a relationship between advertising budget and sales?
  - How strong is the relationship between advertising budget and sales?
  - Which media contribute to sales?
  - How accurately can we predict future sales?
  - Is the relationship linear?
  - Is there synergy among the advertising media?
How could we answer these questions with linear regression?

Simple Linear Regression

A very straightforward approach for predicting a quantitative response $Y$ on the basis of a single predictor variable $X$
The primary assumption that we make is that there is an approximately linear relationship between $X$ and $Y$
We assume a model: $Y = B_0 + B_1X + \epsilon$
- where $B_0$ and $B_1$ are two unknown constants that represent the intercept and slope, also known as coefficients or parameters, and $\epsilon$ is the error term
- We may describe the above equation by saying that we are regressing $Y$ on $X$ (or $Y$ onto $X$)
Once we have utilized our training data to produce estimates $\hat{B_0}$ and $\hat{B_1}$ for the model coefficients, we could predict future sales using: $\hat{y} = \hat{B_0} + \hat{B_1}x$
- where $\hat{y}$ indicates a prediction of $Y$ on the basis of $X = x$
  - For our notation, a “hat” symbol denotes an estimated value

Estimation of the Parameters by Least Squares

Our goal is to obtain coefficient estimates $\hat{B_0}$ and $\hat{B_1}$ such that our linear model fits the available data well
We want to find an intercept $\hat{B_0}$ and a slope $\hat{B_1}$ such that the resulting line is as close as possible to the $n$ data points
There are a number of ways of measuring closeness. However, by far the most common approach involves minimizing the least squares criterion
Let $\hat{y} = \hat{B_0} + \hat{B_1}x$ be the prediction for $Y$ based on the $i$th value of $X$
- Then $e_i = y_i - \hat{y_i}$ represent the $i$th residual
  - The residual represents the difference between the $i$th observe response value and the $i$th response value that is predicted by our linear model
We define the residual sum of squares (RSS) as: $RSS = e_1^2 + e_2^2 + ... + e_n^2$
- Or equivalently: $RSS = (y_1-\hat{B_0}-\hat{B_1x_1})^2+(y_2-\hat{B_0}-\hat{B_1x_2})^2+...+(y_n-\hat{B_0}-\hat{B_1x_n})^2$
The least squares approach chooses the $\hat{B_0}$ and $\hat{B_1}$ that minimizes the RSS
Let’s examine this at work by examining the advertising data

For the advertising data, the least squares fit for the regression of sales onto TV is shown. The fit is found by minimizing the residual sum of squares. Each grey line segment represents a residual. In this case a linear fit captures the essence of the relationship, although it overestimates the trend in the left of the plot.

Assessing the Accuracy of the Coefficient Estimates

Recall, that we assume that the true relationship between $X$ and $Y$ takes the form $Y = f(X) + \epsilon$, where if $f$ is to be approximated by a linear function, then we can write the following relationship: $Y = B_0 + B_1X + \epsilon$
- Where $B_0$ is the intercept term
  - The expected value of $Y$ when $X$ = 0
- And $B_1$ is the slope
  - The average increase in $Y$ associated with a one-unit increase in $X$
- $\epsilon$ is a catch-all for what we miss with our model
  - For instance, the true relationship is not linear, there may be other variables that cause variation in $Y$, and measurement error
$Y = B_0 + B_1X + \epsilon$ defines the population regression line, which is the best linear approximation to the true relationship between $X$ and $Y$
The least squares regression coefficient estimates characterize the least square line

Bias and Unbiased

Bias
- Occurs on the basis of one particular set of observations $y_1, ..., y_n$ that has the potential to overestimate your parameter of interest, and on the basis of another set of observations, it has the potential to underestimate your parameter of interest
- Alternatively, unbiased estimators do not systematically over or underestimate the true parameter
  - The property of unbiasedness holds for the least squares coefficient estimates (Best Linear Unbiased Estimate or BLUE)
    - If we estimate $B_0$ and $B_1$ on the basis of a particular data set, then our estimates won’t be exactly equal to $B_0$ and $B_1$. But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on

Standard Errors and Confidence Intervals

One potential thought you might have concerns the accuracy of the our sample mean $\hat{\mu}$ as an estimate of $\mu$
We have established that the average of $\hat{\mu}$’s over many data sets will be very close to $\mu$, but that a single estimate $\hat{\mu}$ may be a substantial underestimate or overestimate of $\mu$. How far off will that single estimate of $\hat{\mu}$ be?
Standard Error
- The standard error of an estimator reflects how it varies under repeated sampling
- The standard error of $\mu$ can be written as $SE(\hat{\mu})$
- $Var(\hat{\mu}) = SE(\hat{\mu})^2 = \frac{\sigma^2}{n}$
  - Notice, how this deviation shrinks with $n$. The more observations we have, the smaller the standard error
  - Where $\sigma$ is the standard deviation of each of the realizations of $y_i$ of $Y$
- The standard error tells us the average amount that our estimate $\hat{\mu}$ differs from the actual value of $\mu$
These standard errors can be used to compute confidence intervals.
Confidence Intervals
- A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form: $\hat{B_1}\pm \cdot SE(\hat{B_1})$
- That is, there is approximately a 95% changes that the interval: $[\hat{B_1} - \cdot SE(\hat{B_1}), \hat{B_1} + \cdot SE(\hat{B_1})]$ will contain the true value of $B_1$
  - Under a scenario where we have repeated samples

Hypothesis Testing

Standard errors can also be used to perform hypothesis tests on the coefficients
The most common hypothesis test involves testing the null hypothesis of
- $H_0$: There is no relationship between $X$ and $Y$
- $H_A$: There is some relationship between $X$ and $Y$
  - $H_0$: $B_1$ = 0
  - $H_A$: $B_1$ $\neq$ 0
    - Since if $B_1$ = 0, then the model reduces to $Y = B_0 + \epsilon$, and $X$ is not associated with $Y$
To test the null hypothesis, we compute a t-statistic given by: $t = \frac{\hat{B_1}-0}{SE(\hat{B_1})}$
- This will have a $t$-distribution with n - 2 degrees of freedom (assuming $B_1$ = 0)
- Using statistical software (i.e., R), it is easy to compute the probability of observing any value equal to |$t$| or larger
  - We call this probability the p-value
    - We interpret the p-value as follows: a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response
    - Hence, if we see a small p-value, then we can infer that there is an association between the predictor and the response

Assessing the Overall Accuracy of the Model

In order to determine the accuracy of our model, we can computer either the residual standard error or $R^2$
- Residual Standard Error (RSE)
  - The RSE is an estimate of the standard deviation of $\epsilon$. It is the average amount that the response will deviate from the true regression line
    - $RSE = \sqrt{\frac{1}{n-2}RSS}=\sqrt{\frac{1}{n-2}\sum_{i=1}^{n}(y_i-\hat{y_i})^2}$
      - Where the *residual sum of squares is: $RSS = \sum_{i=1}^{n}(y_i-\hat{y_i})^2$
  - The RSE provides an absolute measure of lack of fit of the model to the data. But since it is measured in unites of Y, it is not always clear what constitutes a good RSE
- $R^2$ (R-Squared Statistic)
  - Provides an alternative measure of fit
  - Takes the form of a proportion - the proportion of variance explained - which means it has a value between 0 and 1 and is independent of the scale of Y
    - $R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}$
    - Where $TSS = \sum(y_i - \bar{y})^2$ is the total sum of squares
      - TSS measures the total variance in the response $Y$ and can be thought of as the amount of variability inherent in the response before the regression is performed
    - And $RSS = \sum_{i=1}^{n}(y_i-\hat{y_i})^2$
      - RSS measures the amount of variability that is left unexplained after performing the regression
    - Therefore, TSS - RSS measures the amount of variability in the response that is explained by performing the regression
    - $R^2$ measures the proportion of variability in $Y$ that can be explained using $X$
  - An $R^2$ statistic that is close to 1 indicates that a large proportion of the variability in the response is explained by the regression. Alternatively, a number near 0 indicates that the regression does not explain much of the variability in the response
    - Which can occur because the linear model is wrong, error variance is high, or both

Multiple Linear Regression

Simple linear regression is a useful approach for predicting a response on the basis of a single predictor variable
However, in practice we often have more than one predictor
How can we extend our analysis of the data in order to accommodate two additional predictors?
Let’s extend the simple linear regression model so that it can accommodate multiple predictors
- We can achieve this by giving each predictor a separate slope coefficient in a single model
- Then the multiple linear regression model takes the form: $Y = B_0 + B_1X_1 + B_2X_2 + ... + B_pX_p + \epsilon$
  - Where $X_j$ represents the $jth$ predictor and $B_j$ quantifies the association between that variable and the response
- We interpret $B_j$ as the average effect of $Y$ of a one unit increase in $X_j$, holding all other predictors fixed

Estimation and Prediction for Multiple Regression

As was the case in the simple linear regression setting, the regression coefficients are unknown and must be estimated
Given estimates $\hat{B_0}, \hat{B_1},...,\hat{B_p}$, we can make predictions using the formula: $\hat{y} = \hat{B_0} + \hat{B_1}x_1 + \hat{B_2}x_2 + ... + \hat{B_p}x_p$
- The parameters are estimated using the same least squares approach that we saw in the context of simple linear regression, where we choose $\hat{B_0}, \hat{B_1},...,\hat{B_p}$ to minimize the sum of squared residuals
  - $RSS = \sum_{i=1}^{n}(y_i-\hat{y_i})^2$
  - $= \sum_{i=1}^{n}(y_i-\hat{B_0}-\hat{B_1}x_1-\hat{B_2}x_2-...-\hat{B_p}x_p)^2$
    - This is done using standard statistical software. The values $\hat{B_0}, \hat{B_1},...,\hat{B_p}$ that minimize RSS are the multiple least squares regression estimates
Results for advertising data
- TV, radio, and newspaper advertising budgets are used to predict product sales using the advertising data
  - Spending an additional $1000.00 on tv advertising is associated with approximately 46 units of additional sales, holding all other variables constant
  - Spending an additional $1000.00 on radio advertising is associated with approximately 189 units of additional sales, holding all other variables constant
  - The relationship between newspaper advertising and sales is not statistically significant

library(tidyverse)

data <- read_csv("Advertising.csv")

## New names:
## Rows: 200 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (5): ...1, TV, radio, newspaper, sales
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

MLR <- lm(sales ~ TV + radio + newspaper, data = data)
summary(MLR)

Some Important Questions

1). Is at least one of the predictors $X_1, X_2, ... , X_p$, useful in predicting the response?

Recall that in the simple linear regression setting, in order to determine whether there is a relationship between the response and the predictor we can simply check whether $B_1$ = 0. In the multiple regression setting with $p$ predictors, we need to ask whether all of the regression coefficients are zero. We can utilize a hypothesis test to answer this question
- $H_0$: $B_1 = B_2 = ... = B_p = 0$
- $H_A$: at least one $B_j$ is non-zero
  - We perform this hypothesis test by computing the can utilize the F-statistic: $F = \frac{(TSS - RSS) / p}{RSS / (n - p - 1)}$
    - When there is no relationship between the response and predictors, one would expect the F-statistic too take on a value close to 1
    - On the other hand, if $H_A$ is true, then we expect F to be greater than 1
      - How large does the F-statistic need to be before we can reject $H_0$ and conclude that there is a relationship? It turns out that the answer depends on the value of $n$ and $p$. When $n$ is large, a F-statistic that is just a littler larger than 1 might still provide evidence against $H_0$. In contrast, a larger F-statistic is needed to reject $H_0$ if $n$ is small.

2). Do all the predictors help to explain $Y$, or is it only a subset of the predictors that are useful?

The task of determining which predictors are associated with the response, in order to fit a single model involving only those predictors, is referred to as variable selection
- We can estimate multiple models with the predictors that we have (i.e., a model with no predictors, a model with X1, a model with X2, and model with both X1 and X2)
- From there we can examine the AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), or adjusted $R^2$ to determine which variables are useful to the model
Forward Selection
- We begin with the null model - a model that contains an intercept but no predictors. We then fit $p$ simple linear regressions and add to the null model the variable that results in the lowest RSS. We then add to that model the variable that results in the lowest RSS for the new two-variable model. This approach is continued until some stopping rule is satisfied
Backward Selection
- We start with all variables in the model, and remove the variable with the largest p-value—that is, the variable that is the least statistically significant. The new (p − 1)-variable model is fit, and the variable with the largest p-value is removed. This procedure continues until a stopping rule is reached. For instance, we may stop when all remaining variables have a p-value below some threshold
Mixed Selection
- This is a combination of forward and backward selection. We start with no variables in the model, and as with forward selection, we add the variable that provides the best fit. We continue to add variables one-by-one. If at any point the p-value for one of the variables in the model rises above a certain threshold, then we remove that variable from the model. We continue to perform these forward and backward steps until all variables in the model have a sufficiently low p-value, and all variables outside the model would have a large p-value if added to the model

3). How well does the model fit the data?

Two of the most common numerical measures of model fit are the RSE and $R^2$
- Recall, $R^2$ will always increase when more variables are added to the model, even if those variables are only weakly associated with the response
  - This is due to the fact that adding another variable always results in a decrease in the residual sum of squares

4). Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

Once we have fit the multiple regression model, it is straightforward to predict the response $Y$ on the basis of a set of values for the predictors $X_1, X_2,...,X_p$. There are three sorts of uncertainty that will be associated with these predictions:
- Our least squares estimates are only estimates for the true population
- This inaccuracy in the coefficient estimates is related to the reducible error
  - We can compute confidence intervals in order to determine how close $\hat{Y}$ will be to $f(X)$
Of course, in practice assuming a linear model for $f(X)$ is almost always an approximation of reality, so there is an additional source of potentially reducible error which we call model bias. So when we use a linear model, we are in fact estimating the best linear approximation to the true surface. However, here we will ignore this discrepancy, and operate as if the linear model were correct
Even if we knew $f(X)$ that is even if we knew the true values of $B_0, B_1,...,B_p$ the response value cannot be predicted perfectly because of the random error ($\epsilon$) in the model. We refer to this as the irreducible error
- How much will $Y$ vary from $\hat{y}$?
  - We use prediction intervals to answer this question
    - Prediction intervals are always wider than confidence intervals, because they incorporate both the error in the estimate for f(X) (the reducible error) and the uncertainty as to how much an individual point will differ from the population regression plane (the irreducible error)

Other Considerations in the Regression Model

Qualitative Predictors
- Some predictors are not quantitative but are qualitative, taking a discrete set of values
- These are also called categorical predictors or factor variables
- Example:
  - Investigate differences in credit card balance between males and females, ignoring the other variables. We create a new variable
    - $x_i = \left\{\begin{matrix} 1 \\ 0 \end{matrix}\right.$
      - Where $x_i$ = 1 if the $ith$ person is female and $x_i$ = 0 if the $ith$ person is male
      - Which results in the following model:
        
        $y_i = B_0 + B_1x_i + \epsilon$
        
        For female: $y_i = B_0 + B_1 + \epsilon$
        
        For male: $y_i = B_0 + \epsilon$

The average credit card balance for males is estimated to be $509.80
Whereas females are estimated to carry $19.73 in additional credit card balance for a total of $509.80 + $19.73 = $529.53
- However, notice that the p-value for the dichotomous variable is large, which indicates that there is no statistical evidence of a difference in average credit card balance based on gender

Extensions of the Linear Model

The standard linear regression model provides interpretable results and works quite well on many real-world problems
However, it makes several highly restrictive assumptions that are often violated in practice
- The additivity assumption means that the association between a predictor $X_j$ and the response $Y$ does not depend on the values of the other predictors
- The linearity assumption states that the change in $Y$ associated with a one-unit change in $X_j$ is constant, regardless of the value of $X_j$

Removing the Additive Assumption

Interactions and Nonlinearity
- Interactions:
  - In our previous analysis of the Advertising data, we assumed that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media
    - For example, the linear model: $\hat{sales} = B_0 + B_1 \times TV + B_2 \times radio + B_3 \times newspaper$
      - states that the average effect on sales of a one-unit increase in TV is always $B_1$, regardless of the amount spent on radio
      - But suppose that spending money on radio advertising actually increases the effectiveness of TV advertising, so that the slope term for TV should increase as radio increases
      - In this situation, given a fixed budget of $100, 000, spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio
      - In marketing, this is known as a synergy effect, and in statistics it is referred to as an interaction effect

interact <- lm(sales ~ TV*radio, data = data)
summary(interact)

The results in this table suggests that interactions are important
The p-value for the interaction term $TV \times radio$ is extremely low, indicating that there is strong evidence for $H_A:B_3 \neq 0$
The $R^2$ for the interaction model is 96.8% compared to only 89.7% for the model that predicts sales using TV and radio without an interaction term
This means that (96.8 − 89.7)/(100 − 89.7) = 69% of the variability in sales that remains after fitting the additive model has been explained by the interaction term
The coefficient estimates in the table suggest that an increase in TV advertising of 1,000 dollars is associated with increased sales of $(\hat{B_1}+\hat{B_3}\times radio) \times 1000 = 19 + 1.1 \times radio$ units
The coefficient estimates in the table suggest that an increase in radio advertising of 1,000 dollars is associated with increased sales of $(\hat{B_2}+\hat{B_3}\times TV) \times 1000 = 29 + 1.1 \times TV$ units

Hierarchy

Sometimes it is the case that an interaction term has a very small p-value, but the associated main effects (in this case, TV and radio) do not
The hierarchy principle:
- If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant
- The rationale for this principle is that interactions are hard to interpret in a model without main effects — their meaning is changed
- Specifically, the interaction terms also contain main effects, if the model has no main effect terms

Potential Problems

1). Non-linearity of the response-predictor relationships

The linear regression model assumes that there is a straight-line relationship between the predictors and the response
If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect
Residual plots are useful tool for identifying non-linearity

MLR2 <- lm(sales ~ TV + radio + newspaper, data = data)
summary(MLR2)

par(mfrow = c(2, 2))
plot(MLR2)

MLR2 <- lm(sales ~ TV + radio + newspaper, data = data)
summary(MLR2)

datamod <- data %>%
  mutate(modfitted = predict(MLR2),
         modresiduals = residuals(MLR2))

ggplot(data = datamod, mapping = aes(x = modfitted, y = modresiduals)) +
  geom_point() + 
  geom_hline(yintercept = 0, linetype="dashed", color = "red") +
  geom_smooth(se = FALSE) +
  scale_y_continuous(breaks = seq(-9, 6, by = 3), limits = c(-9, 6)) +
  theme_bw() +
  theme(plot.background = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_blank())

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

If the residual plot indicates that there are non-linear associations in the data, then a simple approach is to use non-linear transformations of the predictors, such as $log(X)$, $\sqrt{X}$, and $X^2$ in the regression model

2). Correlation of error terms

An important assumption of the linear regression model is that the error terms are uncorrelated
The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms
- If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be
- In addition, p-values associated with the model will be lower than they should be; this could cause us to erroneously conclude that a parameter is statistically significant
  - Think of time series data or panel data

3). Non-constant variance of error terms

Another important assumption of the linear regression model is that the error terms have a constant variance
The standard errors, confidence intervals, and hypothesis tests associated with the linear model rely upon this assumption
Heteroscedasticity
- Will usually present itself as a funnel shape in the residual plot
- The variances of the error terms may increase with the value of the response
- When faced with this problem, transform Y ($log(Y)$, $\sqrt{Y}$)

4). Outliers

Are a point for which $y_i$ is far from the value predicted by the model
- Outliers can arise for a variety of reasons, such as incorrect recording of an observation during data collection
Residual plots can be used to identify outliers

5). High-leverage points

Have an unusual value for $x_i$
In order to quantify an observation’s leverage is to compute the leverage statistic. A large value of this statistic indicates an observation with high leverage
$h_i = \frac{1}{n}+\frac{(x_i - \bar{x})^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}$
- $h_i$ increases with the distance of $x_i$ from $\bar{x}$

6). Collinearity

refers to the situation in which two or more predictor variables are closely related to one another
- A simple way to detect collinearity is to look at a correlation matrix of the predictors
Multicollinearity
- collinearity between two or more predictors
Variance Inflation Factor
- The ratio of the variance of $\hat{B_j}$ when fitting the full model divided by the variance of $\hat{B_j}$ if fit on its own
  - The smallest VIF is 1, which indicates the complete absence of collinearity

Comparison of Linear Regression with K-Nearest Neighbors

Recall, linear regression is an example of a parametric approach because it assumes a linear functional form for $f(X)$
- Parametric methods have several advantages, primarily they are easy to fit
- Parametric methods do have some disadvantages
  - they make strong assumptions about $f(X)$
    - If the specified functional form is far from the truth, and prediction accuracy is our goal, then the parametric method will perform poorly
Non-parametric methods do not explicitly assume a parametric form of $f(X)$
- Provide a more flexible approach to performing regression
KNN (K-Nearest Neighbors Regression)
- Given a value for $K$ and a prediction point $x_0$, KNN regression first identifies the $K$ training observations that are closest to $x_0$, represented by $N_0$. It then estimates $f(x_0)$ using the average of all the training responses in $N_0$: $\hat{f}(x_0) = \frac{1}{K}\sum_{x_i \epsilon N_0}y_i$
- In general, the optimal value for $K$ will depend on the bias-variance trade-off
  - A small value for $K$ provides the most flexible fit, which will have low bias but high variance
    - This variance is due to the fact that the prediction in a given region is entirely dependent on just one observation
  - A large value for $K$ provides a smoother and less variable fit, however, the smoothing may cause bias by masking some of the structure in $f(X)$
    - Recall, the prediction region is an average of several points, so changing one observation has a smaller effect

Parametric vs. Non-parametric

When will a parametric approach outperform a non-parametric approach?
- A parametric approach will outperform the non-parametric approach if the parametric form that has been selected is close to the true form of $f$

The above provides an example of KNN regression on data with 50 observations
- The true relationship is given by the black solid line
- The blue curve corresponds to $K = 1$ (on the left) and $K = 9$ (on the right)
- In this instance, the $K=1$ predictions are far too variable, while the smoother $K=9$ fit is much closer to $f(X)$
  - However, since the true relationship is linear, it is hard for a non-parametric approach to compete with linear regression: a non-parametric approach incurs a cost in variance that is not offset by a reduction in bias
As a general rule, parametric methods will tend to outperform non-parametric approaches when there is a small number of observations per predictor
- Curse of Dimensionality
  - As the number of predictors increase, relative to the number of observations, KNN’s performance will decrease